Databricks Templatized Transformation Jobs
Data transformation is the process of converting, cleansing, and structuring data into a usable format that can be analyzed to support decision making processes. The data transformation process converts raw data into a usable format by removing duplicates, converting data types, and enriching the dataset. The process involves defining the structure, mapping the data, extracting the data from the source system.
Lazsa Data Pipeline Studio (DPS) provides templates for creating transformation jobs. The jobs include join/union/aggregate functions that can be performed to group or combine data for analysis.
For complex operations to be performed on data, Lazsa DPS provides the option of creating custom transformation jobs. For custom queries while the logic is written by the users, DPS UI provides an option to create SQL queries by selecting specific columns of tables. Lazsa consumes the SQL queries along with the transformation logic, to generate the code for custom transformation jobs.
To create a Databricks templatized transformation job
-
Sign in to the Lazsa Platform and navigate to Products.
-
Select a product and feature. Click the Develop stage of the feature and navigate to Data Pipeline Studio.
-
Create a pipeline with the following nodes:
Note: The stages and technologies used in this pipeline are merely for the sake of an example.
-
Data Source - REST API
-
Data Integration - Databricks
-
Data Lake - Amazon S3
-
Data Transformation - Databricks
-
In the data transformation stage click the Databricks node, and select CreateTemplatized Job - to create a transformation job using the out-of-the-box template provided by the Lazsa Platform.
-
To create a templatized Databricks transformation job, complete the following steps:
Provide job details for the data transformation job:
-
Template - Based on the source and destination that you choose in the data pipeline, the template is automatically selected.
-
Job Name - Provide a name for the data transformation job.
-
Node Rerun Attempts - This is the number of times the pipeline rerun is attempted on this node in case of pipeline failure. The default setting is done at the pipeline level.
Configure the source node:
-
Review the following configuration details of the source node:
-
Source - the selected source is displayed.
-
Datastore - the selected datastore is displayed.
-
Choose Source Format - the source format Parquet is preselected. Currently Lazsa supports the Parquet format for Amazon S3.
-
Choose Base Path - select the required folder from the folder structure in the drop-down. If you select one folder you can only perform aggregation. To perform the join and union functions, you must select more than one folder.
The source data is picked up for transformation from the selected path.
-
-
Click Next.
Select the options for transformation:
-
Source -
Datastore - the S3 datastore is already selected.
-
Choose Target Format - select one the following target formats:
-
Source Data Format- select this option if you want to maintain the data format in the target, similar to the one in source.
-
Parquet - select this option if you want to use parquet format for target data.
-
Delta Table - select this option if you want to create a table with delta data.
-
-
Target Folder - select a target folder on S3.
-
Target path - provide a folder name that you want to append to the target folder. This is optional.
-
Operation Type - choose the type of operation that you want to perform on the data files from the following options:
-
Append - add new data to the existing data.
-
Overwrite - replace the old data with new data.
-
-
Enable Partitioning - enable this option if you want to use partitioning for the target data. Select from the following options:
-
Data Partition - Select the filename, column details, enter the column value. Click Add.
-
Date Based Partitioning - Select the type of partitioning that you want to use for target date from the options - Yearly, Monthly, Daily. Add a prefix to the partition folder name. This is optional.
You can review the final path of the target file. This is based on the inputs that you provide.
-
Configure the target node:
-
Source -
Datastore - the S3 datastore is already selected.
-
Choose Target Format - select one the following target formats:
-
Source Data Format- select this option if you want to maintain the data format in the target, similar to the one in source.
-
Parquet - select this option if you want to use parquet format for target data.
-
Delta Table - select this option if you want to create a table with delta data.
-
-
Target Folder - select a target folder on S3.
-
Target path - provide a folder name that you want to append to the target folder. This is optional.
-
Operation Type - choose the type of operation that you want to perform on the data files from the following options:
-
Append - add new data to the existing data.
-
Overwrite - replace the old data with new data.
-
-
Enable Partitioning - enable this option if you want to use partitioning for the target data. Select from the following options:
-
Data Partition - Select the filename, column details, enter the column value. Click Add.
-
Date Based Partitioning - Select the type of partitioning that you want to use for target date from the options - Yearly, Monthly, Daily. Add a prefix to the partition folder name. This is optional.
You can review the final path of the target file. This is based on the inputs that you provide.
-
Enable the option if you want to publish the metadata related to the data to AWS Glue Metastore on S3:
-
Metadata Store - currently Lazsa DPS supports AWS Glue.
-
Select a configured AWS Glue Catalog from the drop-down. See Configuring AWS Glue.
-
Database - the database name is populated based on the selection.
-
Data Location - the location is created based on the selected S3 datastore and database.
-
Select Entity - select an entity from the drop-down.
-
Glue Table - either select an existing Glue table to which the metadata gets added or create a new Glue table to add the metadata.
To run the configured job, select the type of Databricks cluster from the following options:
Cluster | Select the all-purpose cluster that you want to use. |
---|---|
Cluster Details | |
---|---|
Choose Cluster | |
Job Configuration Name | |
Databricks Runtime Version | |
Worker Type | |
Workers | |
Enable Autoscaling | |
Cloud Infrastructure Details | |
First on Demand | |
Availability | |
Zone | |
Instance Profile ARN | |
EBS Volume Type | |
EBS Volume Count | |
EBS Volume Size | |
Additional Details | |
Spark Config | |
Environment Variables | |
Logging Path (DBFS Only) | |
Init Scripts |
SQS and SNS | |
---|---|
Configurations | |
Events
|
|
Event Details | |
Additional Parameters |